# Mass Imputation

* Execution: *banff.massimp()*
* SDE function type: *Review, Selection, Treatment*
* Input status flags: *None*
* Output status flags: *IMAS*

## Description

Performs donor imputation for a block of variables using a nearest neighbour approach or random selection.

The `massimp` procedure is intended for use when a large block of variables is missing for a set of respondents, typically when detailed information is collected only for a subsample (or second phase sample) of units. While the `donorimp` procedure uses both system and user matching fields, mass imputation only considers user matching fields to find a valid record (donor) that is most similar to the one which needs imputation (recipient).

Mass imputation considers a recipient any record for which all the variables to impute (`must_impute`) are missing on `indata`, and considers a donors any record for which none of the listed variables are missing. If matching fields (`must_match`) are provided by the user, the `massimp` procedure uses them to find the nearest donor using the same distance function as `donorimp`. If matching fields are not provided, a donor is selected at random.

Unlike `donorimp`, the `massimp` procedure does not use edits. Before running the procedure, users should ensure that the pool of potential donors do not include any errors, including outliers or consistency errors.

Users may create by-groups by specifying `by` variables. These by-groups act as imputation classes. Use the `min_donors` and `percent_donors` parameters to ensure an appropriate number or ratio of recipients and donors exist in each imputation class before performing imputation.

For a full mathematical description of the procedure methods, with examples, please see the [Functional Description](../Banff%20Functional%20Description.pdf).

## Input and output tables

Descriptions of input and output tables are given below. Banff supports a number of input and output formats; please see the Banff User Guide for more information.   

| Input Table | Description |
| ------------| ----------- |
| indata      | Input statistical data. Mandatory. |

| Output Table | Description |
| -------------- | ----------- |
| outdata      | Output statistical table containing imputed data. <br><br> Note that outdata will only contain successfully imputed records and affected fields. Users should update indata with the values from outdata before continuing the data editing process. |
| outstatus    | Output status file identifying imputed fields with IMAS status flags, and their values after imputation. |
| outdonormap  | Output table of recipient-donor pairs for successfully imputed records. |

For details on the content of output tables, please see the [Output Tables](../output_tables.md) document.

## Parameters

| Parameter        | Python type  | Description                 | 
| ---------------- | -------------| --------------------------- |
| unit_id          | str          | Identify key variable (unit identifier) on indata. Mandatory. <br><br> Must be unique for each record. Records with a missing value are dropped before processing. |
| must_impute      | str          | Variables(s) to be imputed. Mandatory. <br><br> To be a recipient, all the variables listed on `must_impute` must be missing for an observation. If all these variables are non missing, then this observation is a potential donor.<br><br> Example: `must_impute="revenue_q1 revenue_q2 revenue_q3 revenue_q4"`|
| must_match       | str          | User defined matching field(s).<br><br> must_match is optional; when not used, random parameter must be specified and a donor will be selected randomly for each recipient.<br><br> Example: `must_match="revenue headcount"`|
| random           | bool         | Random selection of donors.<br><br> When `random` is used alongside `must_match`, random selection will be applied to recipients with missing values for all `must_match` fields. |
| min_donors       | int          | Minimum number of donors required to perform imputation. Default=30. |
| percent_donors   | float        | Minimum percentage of donors required to perform imputation. Default=30. |
| n_limit          | int          | Maximum number of times a donor can be used. |
| mrl              | float        | Multiplier ratio limit. <br><br> This parameter is multiplied by the ratio of the numbers of recipients to donors, then the result will be the maximum number of times a donor can be used.
| seed             | int          | Specify the root for the random number generator. <br><br> The seed is used to ensure consistent results from one run to the next. If not specified or specified as a non-positive value, a random number is generated by the procedure.  |
| accept_negative  | bool         | Treat negative values as valid. Default=False. <br><br> By default, a positivity edit is added for every variable in the list of edits; this parameter permits users to remove this restriction. If required, users may directly add positivity edits for individual variables. |
| by               | str          | Variable(s) used to partition indata into by-groups for independent processing. <br><br> In massimp, by-groups can also be seen as imputation classes. <br><br> Example: `by = "province industry"` |
| presort          | bool         | Sorts input tables before processing, according to procedure requirements. Default=True. |
| no_by_stats      | bool         | Reduces log output by suppressing by-group specific messages. Default=False. |

## Notes

### Nearest neighbour or random donor

The parameters `must_match` and `random` determine whether the nearest-neighbour algorithm or random selection is used to select donors. The following table shows how specifying these parameters affects mass imputation.

| `must_match` specified | `random` specified   | Syntax    | Imputation                                       | 
| -------------| ----------- | ----------| -----------------------------------------------  |
| No           | No          | Incorrect | Results in an error, no imputation is performed. |
| No           | Yes         | Correct   | Random selection of donors.                      |
| Yes          | No          | Correct   | Nearest neighbour selection using `must_match` variables. |
| Yes          | Yes         | Correct   | Nearest neighbour selection using `must_match` variables, or random selection for recipients with missing values for the `must_match` variables. |

If a recipient has missing values for some but not all `must_match` variables, the distance to the closest donor will be based only on the `must_match` variables that have valid values. If a recipient has missing values for all `must_match` variables, then it will be randomly matched to a donor if the `random` parameter is specified, and it will not be matched to any donor if the `random` parameter is not used.

### Multiple equivalent solutions

In some cases, for a given recipient, there may be multiple equidistant donors (i.e. having the same distance from the recipient) whose values would allow the recipient to pass the edits. When this occurs, the procedure selects one of these solutions at random.

For development or testing purposes, users may wish to produce consistent results over multiple runs of the procedure, and may do so using the `seed` parameter. It ensures that the same solutions will be selected from one run to the next, if executed on the same set of inputs. Note that if `seed` is not specified, the system generates a default seed.

This parameter can also be used to replicate results when random donor selection is performed.

### Limiting the repeated use of donors

Users may limit the repeated use of donors with the interrelated `n_limit` and `mrl` parameters. The donor limit is calculated as follows, depending on whether one or both are specified:

| `n_limit` | `mrl`  | Donor Limit                                        | 
| --------- | -------| ------------------------------------------------   |
| No        | No     | Number of times a donor can be used is unlimited.  |
| No        | Yes    | round(`mrl`*(recipients/donors)).                  |
| Yes       | No     | `n_limit`.                                         |
| Yes       | Yes    | round(max(`n_limit`,`mrl`*(recipients/donors))).   |

When limiting the number of donors with the `n_limit` parameter, the number of remaining donors may end up being less than `min_donors`. In such a case, mass imputation will continue and ignore `min_donors` which was validated at the beginning. The same applies for `percent_donors`.